Data Science for Business Applications
Exam Project: Features for Creating a Successful Business - An Analysis on Yelp Reviews
- Executive Summary
- Introduction to the Data
- Preprocessing
- Market delimination
- Merge users phoenix.
- Import JSON in chunk
- Import phoenix data set
- Iterate through users and merge with phoenix data set
- Create experience column, based on review star.
- Visualize distribution of columns before removing outliers
- Remove outliers
- Create final data set, that contains businesses in Phoenix and their respective reviews, along with useful information about the user who gave said review. "Outliers" are removed from data set.
- Visualize distribution of columns after removing outliers
- Analysis
Executive Summary
This section elaborates on the relevance of our analysis and methods utilized in this paper for business-decision purposes.
The main purpose with this paper is to allow a chosen category of businesses, in a specific area, to filter relevant reviews and based on this, make better-informed choices. In this paper we discover the most popular categories and cities that have been reviewed. Based on that, we chose to delve into Phoenix with relatively high demand(reviews) and low market saturation. We conduct a thorough sentiment analysis on the negative and positive reviews, which allows the businesses to grasp an understanding of what their customers are satisfied and dissatisfied about. This entails a comprehension of what facets to improve in the various nightlife entities. By also identifying the customers that tend to be extremely positive and negative, we aim to exclude the reviews that tend to be too extreme in order to reduce the biases in our project. Specifically, when analysing the nightlife in Phoenix, we figure out the most popular nightlife categories, which draws a picture of what new entrants should consider opening in order to increase the success rate at launch. ((((Write about topic modeling))))
In this project, we have chosen to work with an open dataset provided by Yelp (you can download and read more about the data set here).
Yelp is an online review platform that enables people to find user recommendations of a wide range of businesses - and write reviews of their own experiences. To give a review as user is asked to give a star rating on a scale from 1 to 5 and include a description of the experience the user had with the entity.
This dataset was made available for use in personal, educational and academic purposes. It is a subset consisting of reviews, businesses and users across 10 metropolitan areas. More specifically, it contains data on more than 8 million reviews, two-hundred-thousand businesses, and almost 2 million users. The dataset is downloaded as a compressed tar-file format. Uncompressed there are 5 json-files and a total of 9.8GB data - a rather huge dataset.
On a Data Science stand point this dataset is rather interesting as it contains a vast amount of data, including structured and unstructured data (text). For the more demanding project, Yelp has made an additional dataset available containing 200.000 photos that has been posted on their platform in connection with the reviews. These photos however, are not included in this project.
Problem Statement
After a thorough inspection of the documentation, we decided to use three main data sets: Business.json, review.json and user.json. Thus, check-ins and tips (shorter reviews) has been excluded from the analysis.
We found these datasets to be interesting for our project as they contain text reviews and additional related data about the users that has writes the reviews and the businesses that reviewed. This allows us to extract features of the data to gain important and actionable insights about the businesses and the markets they operate in.
Inspired by House of Cards (How Netflix Used Data Science to Create One of the Most Loved Shows Ever: House of Cards) we came up with an interesting subject for analysis - our problem statement:
Description of the Datasets
A description of the dataset is listed in the documentation. Here, we provide you with a simplified description of the datasets:
Business.json
Contains bussiness data including location data, attributes and categories:
- business_id: ID of the business
- name: name of the business
- address: address of the business
- city: city of the business
- state: state of the business
- postal_code: postal code of the business
- latitude: latitude of the business
- longitude: longitude of the business
- stars: average rating of the business
- review_count: number of reviews received
- is_open: 1 if the business is open, 0 if the business has closed down
- attributes:
- categories: multiple categories of the business
- hours: business opening hours
Review.json
Contains full review text data including the user_id that wrote the review and the business_id the review was written for:
- review_id: ID of the review
- user_id: ID of the user
- business_id: ID of the business
- stars: stars given in review
- date: time of review
- text: the review
- useful: number of people that found the review useful
- funny: number of people that found the review funny
- cool: number of people that found the review cool
User.json
Contains user data including the user's friend mapping and all the metadata associated with the user.
- user_id: ID of the review
- name: name of the user
- review_count: number of reviews given
- yelping_since: user creation date
- friends: the user’s friends as an array of user_ids
- useful: number of useful vores sent by the user
- funny: number of funny vores sent by the user
- cool: number of cool vores sent by the user
- fans: number of fans the user has
- elite: years the user had elite status as an array.
- average_stars: average star rating given by the user
- compliments: number of times the user was complimented.
Methodology
Steps of Analysis
To anwer the question of interest, we will perform an exploratory data analysis (EDA) to gain an understanding of the business context and which factors that has proven a business successful or not. Here, we will investigate what business categories that the users are mostly interested in and where the demand is located. Furthermore, we explore which businesses that has been successful in attracting customers to their business and how well they succeed in satisfying the consumers' needs. Finally, for the data exploration, we investigate the correlation between customer satisfaction and whether the business is still in business or not.
From the EDA a few businesses of interest will be further investigated by applying Topic Modeling to understand what features in the reviews that set the best performing businesses apart from the lowest performing businesses. These insights can help future businesses to design a better business model and to know what the customers' main success criteria and complaints are.
Moreover, we will apply a supervised machine learning model on top of the topic modeling to enable businesses to predict the outcome of different narratives. This will enable businesses to delve deeper into their business design and make decision on what features they should include in order to provide the best possible customer experience. I other words, this model enables business to design their business on the narratives they would like the customers to experience - a very powerful tool.
Preprocessing Steps
To prepare our data for analysis we will start by delimitting the data to represent a business category and location of interest. We do this as we assume that customer demand may wary across different locations and categories.
Moving on, we will look at the users and identify those that tend to be too negative or positive in their reviews. This is an important step, as some users might be “notorious complainers” and thus might skew the results.
Similarly, the number of reviews assigned to a business is an important aspect to consider - the less amount of data the less representative. These businesses will therefore present unreliable results. In order to avoid this problem, we will filter out businesses that fail to exceed a number of reviews assigned to them.
Limitation:
We only read in 1,000,000 lines of data from the review.json as well as from the user.json. Those 2 data sets were rather large, and our computers lacked the necessary memory to work with them. What this means is that we “lost” some information, however we thought that this would not interfere with our ability to demonstrate our proficiency of the newly acquired methods and tools. (Data preprocessing, EDA, Topic Modeling, SML, etc.) Eventually, our “final” data set (phoenix_nightlife.csv) reached a size of 1 GB, thus we had plenty of data to work with.
Installation & Libraries
As usual, before we can initialize the preprocessing steps, we will need a repository of libraries to work with the data.
We have provided you with interactive buttons throughout this page, that you can use to display or hide certain code cells of your interest. As such, we have provided you with a list of the installations and libraries below.
We wish you happy reading!
#Installations
!pip install wordcloud
#Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud
from sklearn import metrics
from sklearn import decomposition
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import sklearn.utils as utils
from sklearn.naive_bayes import MultinomialNB
import joblib as joblib
from google.colab import files #Use to upload single files from local directory
from google.colab import drive #Use to mount google drive to access documents saved in the cloud
Delimiting the dataset
To efficiently work with the data, we have created a google drive folder and mounted it to Colab. We have filled the drive with all the relevant json-files and will also be used to host future data sets that are created throughout this project. You can view the Drive here.
In this section we will delimit the area of research to a specific business category and area of interest. This allows us to dig deeper into the data and compare businesses that operate within the same space.
We will start by identifying which business categories that are the most popular. Later, we will look at the businesses locations to identify which location is the most interesting for a new business to enter. Here, popularity is defined as the number of businesses that are working within the same category, while the most interesting locations are defined as those with the highest demand compared to the current supply (the locations with the lowest market saturation). It is assumed that the barrier of entry for any given business is lowest when the market saturation is similarly low. Thus, this provides a great indicator of where to invest in a new business.
business_json_path = '/content/drive/MyDrive/DSBA_Project/yelp_dataset/yelp_academic_dataset_business.json'
business = pd.read_json(business_json_path, lines=True)
business.head()
From the data we see that the each business are assigned to multiple categories. For instance, Felinus is working within three different categories: 'pets, pet services and pet groomers'.
To identify the most popular category each business is split into multiple rows, each containing just one single category. Thus, Felinus will be split into 3 rows. Finally, each category is counted to display the ones that are appearing most often within the business data:
It is important to note that some businesses have closed down and are no longer in operation. We might not want to include these when trying to find the most popular business categories.
business_open = business[business.is_open == 1] #Filter out business that has closed
business_explode = business_open.assign(categories = business_open.categories.str.split(', ')).explode('categories') #Split categories into multiple rows
business_popular_cat = business_explode.categories.value_counts() #Count number of times a category appears in the data
business_popular_cat = pd.DataFrame(business_popular_cat)
#Visualization of the 20 most popular business categories
plt.figure(figsize= (25, 15))
plt.style.use('ggplot')
sns.barplot(business_popular_cat.categories[:20], business_popular_cat.index[:20])
plt.title('Top 20 categories of businesses', fontdict= {'fontsize': 24})
plt.xlabel('Number of businesses', fontdict= {'fontsize': 18})
plt.tick_params(labelsize= 16)
plt.savefig('categories_top_20.png')
Chosen Category of Interest
As seen by the barplot, more than 40.000 restaurants are listed on Yelp's platform, making restaurants by far the most popular business category.This could be a clear indication of a business category with fierce competition but could also prove a great place of analysis, as this business segment might need a thorough understanding of the customers' needs in order to successfully position itself in the market, more than any other business.
Even so, we delimit our analysis on the 9th most commonly listed category on Yelp: the Nightlife category. We found this category particularly interesting to analyse as we believe nightlife is a very complex business area where the lines that defines a successful business are quite blurred and washed out. Our hopes are, that we through our analysis can gain a better understanding of, which features that are important for customers to give them a great experience and help new businesses in creating successful business concepts.
business_nightlife = business[business['categories'].str.contains('Nightlife', case= False, na= False)]
reviews_json_path = 'D:/projects/NLP project DS/yelp_academic_dataset_review.json'
size = 1000000
review = pd.read_json(reviews_json_path, lines=True,
dtype={'review_id':str,'user_id':str,
'business_id':str,'stars':int,
'date':str,'text':str,'useful':int,
'funny':int,'cool':int},
chunksize=size)
If you read in a chunk, read_json() return a JsonReader object for iteration.
chunk_list = []
for chunk_review in review:
chunk_review = chunk_review.drop(['review_id','useful','funny','cool'], axis= 1)
chunk_review = chunk_review.rename(columns= {'stars': 'review_stars'})
chunk_merged = pd.merge(business_nightlife, chunk_review, on= 'business_id', how= 'inner')
print(f"{chunk_merged.shape[0]} out of {size:,} related reviews")
chunk_list.append(chunk_merged)
business_nightlife_review = pd.concat(chunk_list, ignore_index=True, join='outer', axis=0)
business_nightlife_review.head()
business_nightlife_review.sample(5)
business_nightlife_review.shape
name = 'business_nightlife_review.csv'
business_nightlife_review.to_csv(name, index= False)
df = pd.read_csv('data/nightlife_review.csv')
df.shape
Look at the top of data set
df.head(10)
df.info()
supply = pd.DataFrame(df.groupby('city').name.nunique().sort_values(ascending= False))
supply.columns = ['supply']
supply.head()
demand = pd.DataFrame(df.groupby('city').size().sort_values(ascending= False))
demand.columns = ['demand']
demand.head()
df_dem_sup = supply.merge(demand, on= 'city')
df_dem_sup.head()
df_dem_sup['market_saturation'] = df_dem_sup.supply / df_dem_sup.demand * 100
Sort the dataframe entries by market saturation (ascending)
df_dem_sup = df_dem_sup[(df_dem_sup.demand > 5000)]
df_dem_sup = df_dem_sup.sort_values(by= 'market_saturation')
df_dem_sup.shape
df_dem_sup.head(20)
plt.figure(figsize= (25, 15))
plt.title('20 LOWEST SATURATED MARKETS (LOWER IS BETTER)', fontdict= {'fontsize': 24})
sns.barplot(df_dem_sup.market_saturation[:20], df_dem_sup.index[:20])
plt.ylabel('cities', fontdict= {'fontsize': 18})
plt.xlabel('market saturation', rotation= 0, fontdict= {'fontsize': 18})
plt.tick_params(labelsize= 16)
plt.savefig('market_saturation.png')
Phoenix is the chosen city, as the market saturation is relatively low and there is a rather high demand present in the city.
phoenix = df[(df.city == 'Phoenix')]
phoenix.head()
phoenix.shape
phoenix.isnull().sum()
name = 'data/phoenix_nightlife.csv'
phoenix.to_csv(name, index= False)
path = 'data/yelp_academic_dataset_user.json'
size = 1000000
users = pd.read_json(path, lines= True,
dtype= {'user_id': str, 'name': str, 'review_count': int, 'yelping_since': str, 'friends': list,
'useful': int, 'funny': int, 'cool': int, 'fans': int, 'elite': list, 'avarage_stars': float,
'compliment_hot': int, 'comliment_more': int, 'compliment_profile': 42, 'compliment_cute': int,
'comliment_list': int, 'compliment_note': int, 'comliment_plain': int, 'compliment_cool': int,
'compliment_funny': int, 'compliment_writer': int, 'compliment_photos': int},
chunksize= size)
phoenix = pd.read_csv('data/phoenix_nightlife.csv')
phoenix.shape
phoenix.head(3)
Rename review_count column in users. Review count column now refers to number of reviews for the specific business and num_review_written refers to the number of reviews written by that user.
chunk_list = []
for chunk_user in users:
chunk_user = chunk_user.drop(['name', 'yelping_since', 'funny', 'cool', 'fans', 'elite', 'compliment_hot',
'compliment_more', 'compliment_profile', 'compliment_cute', 'compliment_list',
'compliment_note', 'compliment_plain', 'compliment_cool', 'compliment_funny',
'compliment_writer', 'compliment_photos'], axis= 1)
chunk_user = chunk_user.rename(columns= {'review_count': 'num_reviews_written'})
chunk_merged = pd.merge(phoenix, chunk_user, on= 'user_id', how= 'inner')
print(f"{chunk_merged.shape[0]} out of {size:,} related users.")
chunk_list.append(chunk_merged)
Concatenate data frame "pieces" into one data frame.
phoenix_nightlife_user = pd.concat(chunk_list, ignore_index= True, join= 'outer', axis= 0)
phoenix_nightlife_user.shape
phoenix_nightlife_user.head(3)
phoenix_nightlife_user = phoenix_nightlife_user.drop('experience', axis= 1)
phoenix_nightlife_user.shape
Positive experience = 1, Negative experience = 0
mappings = {1: 0, 2: 0, 3: 0, 4: 1, 5: 1}
phoenix_nightlife_user['experience'] = phoenix_nightlife_user.review_stars.map(mappings)
phoenix_nightlife_user.experience.sample(5)
plt.title('Distribution of avarage star ratings of users')
plt.hist(phoenix_nightlife_user.average_stars);
plt.title('Distribution of number of reviews written per user')
plt.yscale('log')
plt.hist(phoenix_nightlife_user.num_reviews_written);
plt.title('Distribution of number of reviews per business')
plt.hist(phoenix_nightlife_user.review_count);
IQR = 75th percentile - 25th percentile (Q3 - Q1). Outlier (left tail) = 25th percentile - IQR 1.5. Outlier (right tail) = 75th percentile + IQR 1.5.
Calculate IQR for the columns we intend to "clean".
IQR_stars = phoenix_nightlife_user.average_stars.quantile(0.75) - phoenix_nightlife_user.average_stars.quantile(0.25)
IQR_num_rev_wri = phoenix_nightlife_user.num_reviews_written.quantile(0.75) - phoenix_nightlife_user.num_reviews_written.quantile(0.25)
IQR_rev_count = phoenix_nightlife_user.review_count.quantile(0.75) - phoenix_nightlife_user.review_count.quantile(0.25)
Calculate thresholds for 3 variables to classify outliers.
outlier_stars = IQR_stars * 1.5
outlier_num_rev_wri = IQR_num_rev_wri * 1.5
outlier_rev_count = IQR_rev_count * 1.5
Calculate Q1 and Q3 for 3 variables
Q1_stars = phoenix_nightlife_user.average_stars.quantile(0.25)
Q3_stars = phoenix_nightlife_user.average_stars.quantile(0.75)
Q1_num = phoenix_nightlife_user.num_reviews_written.quantile(0.25)
Q3_num = phoenix_nightlife_user.num_reviews_written.quantile(0.75)
Q1_rev = phoenix_nightlife_user.review_count.quantile(0.25)
Q3_rev = phoenix_nightlife_user.review_count.quantile(0.75)
phoenix_nightlife_user = phoenix_nightlife_user[(phoenix_nightlife_user.average_stars >= Q1_stars - outlier_stars) |
(phoenix_nightlife_user.average_stars <= Q3_stars + outlier_stars) &
(phoenix_nightlife_user.num_reviews_written >= Q1_num - outlier_num_rev_wri) |
(phoenix_nightlife_user.num_reviews_written <= Q3_num + outlier_num_rev_wri) &
(phoenix_nightlife_user.review_count >= Q1_rev - outlier_rev_count) |
(phoenix_nightlife_user.review_count <= Q3_rev + outlier_rev_count)]
phoenix_nightlife_user.shape
name = 'data/phoenix_users.csv'
phoenix_nightlife_user.to_csv(name, index= False)
plt.title('Distribution of avarage star ratings of users')
plt.hist(phoenix_nightlife_user.average_stars);
plt.title('Distribution of number of reviews written per user')
plt.yscale('log')
plt.hist(phoenix_nightlife_user.num_reviews_written);
plt.title('Distribution of number of reviews per business')
plt.hist(phoenix_nightlife_user.review_count);
path_phoenix_nightlife = '/content/drive/MyDrive/DSBA_Project/yelp_dataset/phoenix_nightlife.csv'
phoenix_nightlife = pd.read_csv(path_phoenix_nightlife)
phoenix_nightlife.head()
len(phoenix_nightlife)
star_rating = phoenix_nightlife.groupby(['business_id', 'name', 'city', 'review_count'])['stars'].mean()
star_rating = pd.DataFrame(star_rating).sort_values(by=['stars', 'review_count'], ascending = [False, False])
star_rating.reset_index(inplace=True)
star_rating
In the above table, we get an overview of the higest and lowest rated businesses and their resepctive review counts. In order to make our analysis more valid, we only want to consider businesses with above 1000 reviews for assessing the top rated businesses. It is very likely that a businesses with 5 stars rating and the highest amount of reviews are favoured by more visitors compared to a business with 5 star rating and relatively low amount of review counts.
Likewise, when assessing the worst performing businesses, we only want to consider the businesses with above 100 review counts. These businesses are constantly getting bad ratings compared to businesses with bad ratings, however, also low review counts.
By filtering out in such a way, we are able to select the businesses that are constantly performing at the higest and lowest level, which will give the foundation for building the perfect business concept in Phoenix.
star_rating_skimmed = star_rating[star_rating.review_count >= 1000]
star_rating_skimmed
sns.barplot(star_rating_skimmed.review_count[:20], star_rating_skimmed.name[:20], ci=None, hue=star_rating_skimmed.stars[:20]);
plt.title('Top 20 nightlife businesses in Phoenix')
plt.xscale('log')
star_rating_skimmed = star_rating[star_rating.review_count >= 100]
star_rating_skimmed = pd.DataFrame(star_rating_skimmed).sort_values(by=['stars', 'review_count'], ascending = [True, False])
star_rating_skimmed.head(20)
sns.barplot(star_rating_skimmed.review_count[:20], star_rating_skimmed.name[:20], ci=None, hue=star_rating_skimmed.stars[:20]);
plt.title('Worst 20 nightlife businesses in Phoenix')
plt.xscale('log')
We can conclude from the above two barplots that 'Little Miss BBQ' has a significant review count, and still maintains to keep an average star rate at 5. On the contrary, one of the lowest performing entities which is interesting to look at is 'Hooters' with a review count above 200, but remains as the worst performing business. These two companies will be used as case studies in the review insights chapter.
In the tabular tablefor worst performing businesses, Applebee's Grill + Bar shows up multiple times and these represent multiple entities in Phoenix, as these all have different business IDs.
In order to do this, we split the category column containing the various categories which a business belongs under into several rows. Now, the businesses occur in several rows with the respective category. In this way, we can count how mny reviews fall under each category. See below example with Spoke & Wheel:
phoenix_top_cat = phoenix_nightlife.assign(categories = phoenix_nightlife.categories.str.split(', ')).explode('categories')
phoenix_top_cat = phoenix_top_cat[(phoenix_top_cat.categories != 'Bars') & (phoenix_top_cat.categories != 'Nightlife')]
phoenix_top_cat.head()
len(phoenix_top_cat)
The length of the DataFrame went from approx. 211K to 1,4 Million. Now, we only want businesses with above or equal to 10 reviews as....
phoenix_weight = phoenix_top_cat[phoenix_top_cat.review_count >= 10]
print('The top 10 categories in Nightlife of Phoenix:')
phoenix_weight.categories.value_counts()[:10]
Visualization = phoenix_weight.categories.value_counts()[:10]
viz = pd.DataFrame(Visualization)
plt.figure(figsize= (25, 15))
plt.style.use('ggplot')
sns.barplot(viz.categories, viz.index)
plt.title('The top 10 categories in Nightlife of Phoenix:', fontdict= {'fontsize': 24})
plt.xlabel('Number of reviews', fontdict= {'fontsize': 18})
plt.tick_params(labelsize= 16)
plt.savefig('categories_nightlife.png')
From the above table, we can conclude that 'Restaurants' is the most popular business concept that has been reviewed, specifically American (new) themed restaurants. Secondly, bars score highly, specifically cocktail bars are popular in Phoenix.
Assessing if experience has a significant effect on whether or not a business stays open
So how important are reviews for a business in Phoenix? Does it define whether a business will remain open or not? With these questions in mind, we want to test the probability that a business is closed based on experience
phoenix_weight.is_open.value_counts(normalize= True)
pd.crosstab(phoenix_weight.is_open, phoenix_weight.experience, normalize= True)
We want to compute a simple cross tabulation of the two factors, If a business is open and the overall experience. The return is a frequency table and here we can see that there is a significant proportion of positive experience for the businesses that closed down, compared to the negative experiences. Thus, there is a no correlation between the experience and whether the business stays open or not.
This section inspects the reviews that belong to the subgroup of phoenix businesses that have an avarage rating higher than or equal to 4 stars.
We use both kinds of vectorizer that sklearn offers and apply the SVD algorithm to both document term matrices to see if the outcome is similiar. We expect that the outcomes will be identical.
phoenix = pd.read_csv('data/phoenix_users.csv')
phoenix_pos = phoenix[phoenix.experience == 1]
phoenix_pos.head(3)
Pick n = 5000 random sample, to save memory. Simple Random Sample assumption (a sufficient random! sample can be representative of the population that it is taken from).
text = phoenix_pos.text.sample(5000, random_state= 42)
We create the document term matrix here, that we will decompose later with the intent to find dependencies in it.
c_vectorizer = TfidfVectorizer(stop_words= 'english')
dtm_c = c_vectorizer.fit_transform(text).toarray()
vocab_c = np.array(c_vectorizer.get_feature_names())
tf_vectorizer = CountVectorizer(stop_words= 'english')
dtm_t = tf_vectorizer.fit_transform(text).todense()
vocab_t = np.array(tf_vectorizer.get_feature_names())
# Check shape of document term matrix
print(f'The shape of the document term matrix is : {dtm_c.shape} and the number of tokens in the vocabulary is : {len(vocab_c)}.')
dtm_c.shape == dtm_t.shape
We use a simple helper function to help extract the top words from the "abstract" topics.
# helper function
def show_topics(V, vocab):
top_words = lambda x: [vocab[i] for i in np.argsort(x)[:-num_top_words-1:-1]]
topic_words = ([top_words(x) for x in V])
return [' '.join(x) for x in topic_words]
This is a low rank approximation algorithm (we try to "recreate" the column space of our original matrix with a smaller matrix) and we compute a full svd on the smaller matrix.
d = 5 # number of topics
num_top_words = 10 # number of top words
U_c, s_c, Vh_c = utils.extmath.randomized_svd(dtm_c, d, random_state= 42)
U_t, s_t, Vh_t = utils.extmath.randomized_svd(dtm_c, d, random_state= 42)
Topics from the document term matrix by CountVectorizer.
show_topics(Vh_c, vocab_c)
Topics from the document term matrix by TfidfVectorizer.
show_topics(Vh_t, vocab_t)
As they both produced the same outcome, we only save one list. We will import this list later when we create a visual summary of this finding.
positive_top_topics = show_topics(Vh_t, vocab_t)
name = 'data/positive_top_words.csv'
pd.Series(positive_top_topics).to_csv(name, index= False)
This section inspects the reviews that belong to the subgroup of phoenix businesses that have an avarage rating smaller than or equal to 3.5 stars.
We use both kinds of vectorizer that sklearn offers and apply the SVD algorithm to both document term matrices to see if the outcome is similiar. We expect that the outcomes will be identical.
phoenix = pd.read_csv('data/phoenix_users.csv')
phoenix_neg = phoenix[phoenix.experience == 0]
phoenix_neg.head(3)
Pick n = 5000 random sample, to save memory. Simple Random Sample assumption (a sufficient random! sample can be representative of the population that it is taken from).
text = phoenix_neg.text.sample(5000, random_state= 42)
We create the document term matrix here, that we will decompose later with the intent to find dependencies in it.
c_vectorizer = TfidfVectorizer(stop_words= 'english')
dtm_c = c_vectorizer.fit_transform(text).toarray()
vocab_c = np.array(c_vectorizer.get_feature_names())
tf_vectorizer = CountVectorizer(stop_words= 'english')
dtm_t = tf_vectorizer.fit_transform(text).todense()
vocab_t = np.array(tf_vectorizer.get_feature_names())
# Check shape of document term matrix
print(f'The shape of the document term matrix is : {dtm_c.shape} and the number of tokens in the vocabulary is : {len(vocab_c)}.')
dtm_c.shape == dtm_t.shape
We use the same helper function that we used above to extract the top words.
# helper function
def show_topics(V, vocab):
top_words = lambda x: [vocab[i] for i in np.argsort(x)[:-num_top_words-1:-1]]
topic_words = ([top_words(x) for x in V])
return [' '.join(x) for x in topic_words]
This is a low rank approximation algorithm (we try to "recreate" the column space of our original matrix with a smaller matrix) and we compute a full svd on the smaller matrix.
d = 5 # number of topics
num_top_words = 10 # number of top words
U_c, s_c, Vh_c = utils.extmath.randomized_svd(dtm_c, d, random_state= 42)
U_t, s_t, Vh_t = utils.extmath.randomized_svd(dtm_c, d, random_state= 42)
Topics from the document term matrix created by CountVectorizer
show_topics(Vh_c, vocab_c)
Topics from the document term matrix created by TfidfVectorizer
show_topics(Vh_t, vocab_t)
As they both produced the same outcome, we only save one list. We save it, so that later we can visualize the conclusion.
negative_top_topics = show_topics(Vh_t, vocab_t)
name = 'data/negative_top_words.csv'
pd.Series(negative_top_topics).to_csv(name, index= False)
In this section, we look into the differences of the businesses of the different "levels" of ratings that they achieved.
We use the TfidfVectorizer and apply the NMF algorithm to the document term matrix. Create a different document term matrix for each review stars to see if we can "pick up" some differences among the different reviews.
phoenix = pd.read_csv('data/phoenix_users.csv')
phoenix_1 = phoenix[phoenix.review_stars == 1]
phoenix_2 = phoenix[phoenix.review_stars == 2]
phoenix_3 = phoenix[phoenix.review_stars == 3]
phoenix_4 = phoenix[phoenix.review_stars == 4]
phoenix_5 = phoenix[phoenix.review_stars == 5]
phoenix_1.head(3)
Pick n = 5000 random sample, to save memory. Simple Random Sample assumption.
text_1 = phoenix_1.text.sample(5000, random_state= 42)
text_2 = phoenix_2.text.sample(5000, random_state= 42)
text_3 = phoenix_3.text.sample(5000, random_state= 42)
text_4 = phoenix_4.text.sample(5000, random_state= 42)
text_5 = phoenix_5.text.sample(5000, random_state= 42)
We create the document term matrices here for the different "subgroups".
vectorizer1 = TfidfVectorizer(stop_words= 'english')
dtm_1 = vectorizer1.fit_transform(text_1).toarray()
vocab_1 = np.array(vectorizer1.get_feature_names())
vectorizer2 = TfidfVectorizer(stop_words= 'english')
dtm_2 = vectorizer2.fit_transform(text_2).toarray()
vocab_2 = np.array(vectorizer2.get_feature_names())
vectorizer3 = TfidfVectorizer(stop_words= 'english')
dtm_3 = vectorizer3.fit_transform(text_3).toarray()
vocab_3 = np.array(vectorizer3.get_feature_names())
vectorizer4 = TfidfVectorizer(stop_words= 'english')
dtm_4 = vectorizer4.fit_transform(text_4).toarray()
vocab_4 = np.array(vectorizer4.get_feature_names())
vectorizer5 = TfidfVectorizer(stop_words= 'english')
dtm_5 = vectorizer5.fit_transform(text_5).toarray()
vocab_5 = np.array(vectorizer5.get_feature_names())
Again, we use the helper function to extract the top 10 words from the topics.
# helper function
def show_topics(H, vocab):
top_words = lambda x: [vocab[i] for i in np.argsort(x)[:-num_top_words-1:-1]]
topic_words = ([top_words(x) for x in H])
return [' '.join(x) for x in topic_words]
This is not an exact decomposition, and a fairly new technique. Researchers have been working hard for decades to create this algorithm. We used sklearn's implementation (other ones exist too).
d = 5 # number of topics
num_top_words = 10 # number of top words
nnmf_1 = decomposition.NMF(n_components= d, max_iter= 500, random_state= 42)
W_1 = nnmf_1.fit_transform(dtm_1)
H_1 = nnmf_1.components_
nnmf_2 = decomposition.NMF(n_components= d, max_iter= 500, random_state= 42)
W_2 = nnmf_2.fit_transform(dtm_2)
H_2 = nnmf_2.components_
nnmf_3 = decomposition.NMF(n_components= d, max_iter= 500, random_state= 42)
W_3 = nnmf_3.fit_transform(dtm_3)
H_3 = nnmf_3.components_
nnmf_4 = decomposition.NMF(n_components= d, max_iter= 500, random_state= 42)
W_4 = nnmf_4.fit_transform(dtm_4)
H_4 = nnmf_4.components_
nnmf_5 = decomposition.NMF(n_components= d, max_iter= 500, random_state= 42)
W_5 = nnmf_5.fit_transform(dtm_5)
H_5 = nnmf_5.components_
Topics from 1 star reviews
show_topics(H_1, vocab_1)
Topics from 2 star reviews
show_topics(H_2, vocab_2)
Topics from 3 star reviews
show_topics(H_3, vocab_3)
Topics from 4 star reviews
show_topics(H_4, vocab_4)
Topics from 5 star reviews
show_topics(H_5, vocab_5)
We save the topic words into CSVs, that we later import to create word clouds.
one_top_topics = show_topics(H_1, vocab_1)
two_top_topics = show_topics(H_2, vocab_2)
three_top_topics = show_topics(H_3, vocab_3)
four_top_topics = show_topics(H_4, vocab_4)
five_top_topics = show_topics(H_5, vocab_5)
name1 = 'data/one_top_words.csv'
pd.Series(one_top_topics).to_csv(name1, index= False)
name2 = 'data/two_top_words.csv'
pd.Series(two_top_topics).to_csv(name2, index= False)
name3 = 'data/three_top_words.csv'
pd.Series(three_top_topics).to_csv(name3, index= False)
name4 = 'data/four_top_words.csv'
pd.Series(four_top_topics).to_csv(name4, index= False)
name5 = 'data/five_top_words.csv'
pd.Series(five_top_topics).to_csv(name5, index= False)
In this section we visualize the findings of the above described topic modeling tasks that we did.
We can see, that the all the imports have the same shape, 5 rows (topics) and 1 column (10 top words from each topic)
# Import all CSVs at once as the process of creating wordcloud is the same for all CSVs
five = pd.read_csv('data/five_top_words.csv')
four = pd.read_csv('data/four_top_words.csv')
three = pd.read_csv('data/three_top_words.csv')
two = pd.read_csv('data/two_top_words.csv')
one = pd.read_csv('data/one_top_words.csv')
positive = pd.read_csv('data/positive_top_words.csv')
negative = pd.read_csv('data/negative_top_words.csv')
# Check shapes of data frames, to make sure import worked
five.shape, four.shape, three.shape, two.shape, one.shape, positive.shape, negative.shape
We transform the lists to long strings, as the WordCloud function only accepts strings.
# Concatenate top topic words into string
text_five = " ".join(a for a in five.text)
text_four = " ".join(b for b in four.text)
text_three = " ".join(c for c in three.text)
text_two = " ".join(d for d in two.text)
text_one = " ".join(e for e in one.text)
text_positive = " ".join(f for f in positive.text)
text_negative = " ".join(g for g in negative.text)
Wordcloud 1 stars
wc1 = WordCloud(max_font_size= 50, max_words= 40, background_color= 'white').generate(text_one)
plt.figure(figsize=[20, 10])
plt.imshow(wc1, interpolation= 'bilinear')
plt.axis('off')
plt.show()
Wordcloud 2 stars
wc2 = WordCloud(max_font_size= 50, max_words= 40, background_color= 'white').generate(text_two)
plt.figure(figsize= [20, 10])
plt.imshow(wc2, interpolation= 'bilinear')
plt.axis('off')
plt.show()
Wordcloud 3 stars
wc3 = WordCloud(max_font_size= 50, max_words= 40, background_color= 'white').generate(text_three)
plt.figure(figsize= [20, 10])
plt.imshow(wc3, interpolation= 'bilinear')
plt.axis('off')
plt.show()
Wordcloud 4 stars
wc4 = WordCloud(max_font_size= 50, max_words= 40, background_color= 'white').generate(text_four)
plt.figure(figsize= [20, 10])
plt.imshow(wc4, interpolation= 'bilinear')
plt.axis('off')
plt.show()
Wordcloud 5 stars
wc5 = WordCloud(max_font_size= 50, max_words= 40, background_color= 'white').generate(text_five)
plt.figure(figsize= [20, 10])
plt.imshow(wc4, interpolation= 'bilinear')
plt.axis('off')
plt.show()
Wordcloud positive experience
wc6 = WordCloud(max_font_size= 50, max_words= 40, background_color= 'white').generate(text_positive)
plt.figure(figsize= [20, 10])
plt.imshow(wc4, interpolation= 'bilinear')
plt.axis('off')
plt.show()
Wordcloud negative experience
wc7 = WordCloud(max_font_size= 50, max_words= 40, background_color= 'white').generate(text_negative)
plt.figure(figsize= [20, 10])
plt.imshow(wc4, interpolation= 'bilinear')
plt.axis('off')
plt.show()
In this section we inspect Little Miss BBQ as it is this business is rather liked in Phoenix.
df = pd.read_csv('data/phoenix_users.csv')
df.head(3)
Data Frame that only contains entries for Little Miss BBQ
# Subset data set
little_miss = df[df.name == 'Little Miss BBQ']
little_miss.head(3)
We create the document term matrix here. The shape of the matrix (2387 x 7806) can be seen below along with the length of the vocabulary (7806). This means that we have 2387 reviews (documents) for Little Miss BBQ and got 7806 tokens after vectorization.
# get reviews
text = little_miss.text
# vectorize
vectorizer = TfidfVectorizer(stop_words= 'english')
dtm = vectorizer.fit_transform(text).toarray()
vocab = np.array(vectorizer.get_feature_names())
# shape of dtm
dtm.shape, len(vocab)
Helper function to extract top words
# Helper function
def show_topics(H, vocab):
top_words = lambda x: [vocab[i] for i in np.argsort(x)[:-num_top_words-1:-1]]
topic_words = ([top_words(x) for x in H])
return [' '.join(x) for x in topic_words]
We used the non-negative matrix factorization.
d = 5 # number of topics
num_top_words = 10 # number of top words
nnmf = decomposition.NMF(n_components= d, max_iter= 500, random_state= 42)
W = nnmf.fit_transform(dtm)
H = nnmf.components_
Topics extracted
show_topics(H, vocab)
words = show_topics(H, vocab)
# join list elements into one string
text_little_miss = " ".join(a for a in words)
wc = WordCloud(max_font_size= 50, max_words= 40, background_color= 'white').generate(text_little_miss)
plt.figure(figsize=[20, 10])
plt.imshow(wc, interpolation= 'bilinear')
plt.axis('off')
plt.show()
In this section we have a closer look at one of the worst performing business in Phoenix.
data = pd.read_csv('data/phoenix_users.csv')
# Subset data set
hooters = data[data.name == 'Hooters']
hooters.head(3)
We create the document term matrix here. The shape of the matrix (478 x 3683) can be seen below along with the length of the vocabulary (3683). This means that we have 487 reviews (documents) for Hooters and got 3683 tokens after vectorization.
text = hooters.text
# vectorize text
vectorizer = TfidfVectorizer(stop_words= 'english')
dtm = vectorizer.fit_transform(text).toarray()
vocab = np.array(vectorizer.get_feature_names())
# check dtm and vocab size
dtm.shape, len(vocab)
Helper function to extract top words
# Helper function
def show_topics(H, vocab):
top_words = lambda x: [vocab[i] for i in np.argsort(x)[:-num_top_words-1:-1]]
topic_words = ([top_words(x) for x in H])
return [' '.join(x) for x in topic_words]
We used the non-negative matrix factorization.
d = 5 # number of topics
num_top_words = 10 # number of top words
nnmf = decomposition.NMF(n_components= d, max_iter= 500, random_state= 42)
W = nnmf.fit_transform(dtm)
H = nnmf.components_
Topics top words for Hooters
show_topics(H, vocab)
words = show_topics(H, vocab)
# concatenate string elements into a string
text_hooters = " ".join(a for a in words)
wc = WordCloud(max_font_size= 50, max_words= 40, background_color= 'white').generate(text_hooters)
plt.figure(figsize=[20, 10])
plt.imshow(wc, interpolation= 'bilinear')
plt.axis('off')
plt.show()
The outcome of the topic modeling corresponds to our intuition, that businesses would get bad reviewsif their staff was rude, the service took ages and the quality of the product was sub-optimal. Words, such as "horrible", "terrible" and "rude" were found among the 1 star reviews, indicating, that potentially the staff was rude, as well as their overall experience was bad. The 2 star reviews contained words such as "slow" and "hour" which allows us to assume that the businesses were down-rated because of the speed of their service. Reviews from the 4, 5 star ranges seemed to highlight the fact, that the stuff was rather friendly, the product was fresh and good quality and the service was fast. The above described tendency was found in the positive vs negative experience topic modeling too.
As the business called "Little Miss BBQ" was the best performing business in Phoenix, we looked into their reviews to see what could cause their success. The outcome seems to be that their service is rather friendly and their food (especially their meat products) are praised for their tastiness. In order to get a more concrete view on what makes a business a "worst performing one", we looked into "Hooters" as their rating is 2 and a large number of reviews agree on the aforementioned star value. Their rating potentially can be explained by slow service and bad enough food.
In this section, we create a classifier that predicts the experience of the reviewer.
phoenix_sentiment = pd.read_csv('data/phoenix_users.csv')
phoenix_sentiment.head(3)
Pipeline allows us to combine preprocessing and modelling in one object (which can be useful as we can cross-validate different preprocessing methods with different potential model representations).
# vectorizer
vectorizer = CountVectorizer(stop_words= 'english')
# model
classifier = MultinomialNB()
# pipeline
pipe = Pipeline([('vect', vectorizer), ('class', classifier)])
This is a method to search through the space of different combinations of hyper-parameters. We used the randomized method to decrease the amount of computation needed. Instead, we could hve used GridSearchCV, however, we would have had to search through 288 (6 4 6 * 2) parameter combination. The cost for less computation is a slight chance that RandomizedSearchCV does not find the global best hyper-parameter combination, however this is rather unlikely.
# dictionary of hyper parameter options
params = {}
params['vect__ngram_range'] = [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)]
params['vect__max_df'] = [1.0, 0.99, 0.98, 0.97]
params['class__alpha'] = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5]
params['class__fit_prior'] = [True, False]
RandomizedSearchCV that "tells" us the best hyper-parameter combination.
CV = RandomizedSearchCV(pipe, params, cv= 3, scoring= 'accuracy')
CV.fit(X_train, y_train)
print('The best score and parameter combination is the following: ')
CV.best_score_, CV.best_params_
# Build best performing vectorizer and model into a pipeline
pipe_best = make_pipeline(CountVectorizer(stop_words= 'english', ngram_range= (1, 2), max_df= 0.99),
MultinomialNB(alpha= 0.7))
# fit pipeline with training data
pipe_best.fit(X_train, y_train)
Get baseline accuracy
# Get predictions for the test set
y_pred = pipe_best.predict(X_test)
phoenix_sentiment.experience.value_counts(normalize= True)
If we predict positive sentiment for all entries, we get it right 70% of the time.
Get accuracy of pipeline
metrics.accuracy_score(y_test, y_pred)
Confusion matrix
metrics.confusion_matrix(y_test, y_pred)
Assign the different entries of the matrix to variables, to create other performance evaluation metrics below.
confusion = metrics.confusion_matrix(y_test, y_pred)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
Summary of model performance
data = {'Sensitivity': sen, 'Specificity': spe, 'False positive rate': fpr, 'Precision': prec}
pd.DataFrame(data= data, index= [0], dtype= np.float32)
The model does a good job of identifying true positives and an "OK" job of identifying the true negatives. We are happy with the model performance, as the precision (positive predictive value) is close to 90 %.
Below the intended use-case is visible. The business would write their hypothetical review and check what the model predicts it to be, therefore they can either include or exclude that potential feature from their business plan. Label 1 corresponds to positive experience, 0 corresponds to negative experience.
test = 'Friendly stuff and awesome decoration, loved it'
test1 = 'Rude service, expensive drinks'
test2 = 'The drink was cheap but the dance floor was way too small'
test3 = 'The drink was cheap and the dance floor was large'
test4 = 'The music was quite loud and got me to dance'
pred = pipe_best.predict([test])
pred1 = pipe_best.predict([test1])
pred2 = pipe_best.predict([test2])
pred3 = pipe_best.predict([test3])
pred4 = pipe_best.predict([test4])
print(f"{test} is {pred}")
print(f"{test1} is {pred1}")
print(f"{test2} is {pred2}")
print(f"{test3} is {pred3}")
print(f"{test4} is {pred4}")
We used the combination of pipelines and randomized search CV. According to our studies, this is a really powerful way to develop proficient models. Moreover, we decided to create a test set, that was only used for validation (not even used in grid search) in order to get the most unbiased estimate of accuracy. We used CountVectorizer to represent text numerically, as MultinomialNB may have problems working with non-integer values. We imagined that this model could be useful for businesses to keep track of the customers who provide a bad review and potentially, the business can follow up with that customer to improve their experience and as a consequence increase the busnibusiness's overall rating. An evaluation criterion for this specific use-case that should be as high as possible is the specifity, thus the business could reach out to the highest number of customers who had a potentially negative experience.